-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Heartbeat] Fix summarizer #36519
[Heartbeat] Fix summarizer #36519
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
@@ -85,7 +85,9 @@ func TestSockTracerWaitFail(t *testing.T) { | |||
started := time.Now() | |||
_, err := NewSockTracer(filepath.Join(os.TempDir(), "garbagenonsegarbagenooonseeense"), waitFor) | |||
require.Error(t, err) | |||
require.GreaterOrEqual(t, time.Now(), started.Add(waitFor)) | |||
// Compare unix millis because things get a little weird with nanos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to the changes here, just noticed this flakey test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found bugs with retrying logic running with max_attempts: 2
.
heartbeat.run_once: true
heartbeat.monitors:
- type: browser
id: test-hb
enabled: true
name: Test HB dev
schedule: '@every 1m'
screenshots: "off"
max_attempts: 2
source:
inline:
script: |-
step("load homepage", async ) => {
});
- Attempts for failed jobs is calculated incorrectly. Check the image and the appropriate debug message
// even for the first attempt
{"log.level":"debug","@timestamp":"2023-09-13T17:05:47.497-0700","log.origin":{"file.name":"summarizer/plugstatestat.go","file.line":175},"message":"attempt info: == down && 2 < 2","service.name":"heartbeat","ecs.version":"1.6.0"}
@@ -46,8 +45,6 @@ func (senr *streamEnricher) enrich(event *beat.Event, se *SynthEvent) error { | |||
type journeyEnricher struct { | |||
journeyComplete bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't think this is an actual issue however. Adding data shouldn't be an issue, modifying or removing it is a different story however.
It wont break anything, but its just unnecessary adding to the storage cost. We were using those on the HB side to measure the step count and such, but not so useful on the UI layer.
} | ||
|
||
func (bwdsp *BrowserDurationPlugin) BeforeSummary(event *beat.Event) BeforeSummaryActions { | ||
if bwdsp.startedAt == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our previous logic which is correct behaviour never sets mon.duration for these cases.
// In case of syntax errors or incorrect runner options, the Synthetics
// runner would exit immediately with exitCode 1 and we do not set the duration
// to inform the journey never ran
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing enrich tests makes these cases easier to escape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks the error logger which always expects a duration. Generally I've found that occasionally not having a field leads to more breakage. I've changed the logic however to just set the duration to zero in these cases. If you have any concerns here LMK.
@@ -46,8 +45,6 @@ func (senr *streamEnricher) enrich(event *beat.Event, se *SynthEvent) error { | |||
type journeyEnricher struct { | |||
journeyComplete bool | |||
journey *Journey | |||
errorCount int | |||
error error | |||
stepCount int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the journeyEnricher code is dead code, we can safely remove most of the properties like start, end, stepCount, urlFields, journeyComplete
etc as is already covered in summarizer plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great call, will remove
}) | ||
|
||
// Add step count meta for log wrapper | ||
eventext.SetMeta(event, logger.META_STEP_COUNT, je.stepCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this has effect on the Logger where the steps would never be logged. we need this logic to be on summarizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ bringing back to plugerr.go where we count steps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except the duration stuff, I am running some small tests before approving the PR around hook errors and other browser checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Except for the case where there are repetition on the error
field for the journey/end
event.
- Synthetics agent generated a after error for a journey
{"type":"journey/end","@timestamp":1695066934081319.8,"journey":{"name":"testing timeout issue - regular","id":"testing timeout issue - regular","status":"failed"},"root_fields":{"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"0.0.0-development"}},"payload":{"start":1300805.898640458,"end":1300806.2263515829,"status":"failed"},"error":{"name":"Error","message":"after Error","stack":"Error: after Error\n at /Users/vigneshh/elastic/synthetics/synth-init/journeys/debug-timeout.journey.ts:30:11\n at /Users/vigneshh/elastic/synthetics/src/helpers.ts:117:40\n at Array.map (<anonymous>)\n at runParallel (/Users/vigneshh/elastic/synthetics/src/helpers.ts:117:30)\n at Runner.runAfterHook (/Users/vigneshh/elastic/synthetics/src/core/runner.ts:180:22)\n at Runner.runJourney (/Users/vigneshh/elastic/synthetics/src/core/runner.ts:369:18)\n at Runner.run (/Users/vigneshh/elastic/synthetics/src/core/runner.ts:452:11)\n at Command.<anonymous> (/Users/vigneshh/elastic/synthetics/src/cli.ts:138:23)"},"package_version":"0.0.0-development"}
- HB produced duplicate error fields, one inside
synthetics.error
anderror
at top level.
{"@timestamp":"2023-09-18T19:41:27.009Z","@metadata":{"beat":"heartbeat","type":"_doc","version":"8.11.0","raw_index":"synthetics-browser-default"},"agent":{"ephemeral_id":"5c814a3d-05f4-4d70-be1a-7eeebce7ff1a","id":"095d58ac-9a94-4709-b918-b07cfcdb4228","name":"Vigneshs-MacBook-Pro.local","type":"heartbeat","version":"8.11.0"},"ecs":{"version":"8.0.0"},"monitor":{"check_group":"5a49b0e6-565b-11ee-9e45-b2772eebffba-1","name":"Improved Stack","id":"improved-stack","timespan":{"lt":"2023-09-18T19:42:27.397Z","gte":"2023-09-18T19:41:27.397Z"},"type":"browser"},"synthetics":{"package_version":"0.0.0-development","index":154,"payload":{"end":1299959.1694570831,"status":"failed","start":1299958.5494264159},"journey":{"name":"testing timeout issue - regular","id":"testing timeout issue - regular"},"error":{"name":"Error","message":"after Error","stack":"Error: after Error\n at /private/var/folders/_h/tz2qn0gx7gd82wxpft62zhwh0000gn/T/elastic-synthetics-unzip-150742446/journeys/journeys/debug-timeout.journey.ts:31:11\n at /Users/vigneshh/elastic/synthetics/src/helpers.ts:117:40\n at Array.map (<anonymous>)\n at runParallel (/Users/vigneshh/elastic/synthetics/src/helpers.ts:117:30)\n at Runner.runAfterHook (/Users/vigneshh/elastic/synthetics/src/core/runner.ts:180:22)\n at Runner.runJourney (/Users/vigneshh/elastic/synthetics/src/core/runner.ts:369:18)\n at Runner.run (/Users/vigneshh/elastic/synthetics/src/core/runner.ts:452:11)\n at Command.<anonymous> (/Users/vigneshh/elastic/synthetics/src/cli.ts:138:23)"},"type":"journey/end"},"os":{"platform":"darwin"},"package":{"version":"0.0.0-development","name":"@elastic/synthetics"},"data_stream":{"namespace":"default","dataset":"browser","type":"synthetics"},"observer":{"ip":["fe80::b077:2eff:feeb:ffba","fe80::b077:2eff:feeb:ffb9","fe80::b077:2eff:feeb:ffb8","fe80::f84d:89ff:fe6b:ff07","192.168.0.24","fe80::5c7e:31ff:feed:b57d","fe80::5c7e:31ff:feed:b57d","fe80::6358:d49d:58e3:86be","fe80::d572:c882:417:aa46","fe80::ce81:b1c:bd2c:69e"],"mac":["36-C3-70-61-2B-00","36-C3-70-61-2B-04","36-C3-70-61-2B-08","5E-7E-31-ED-B5-7D","B2-77-2E-EB-FF-98","B2-77-2E-EB-FF-99","B2-77-2E-EB-FF-9A","B2-77-2E-EB-FF-B8","B2-77-2E-EB-FF-B9","B2-77-2E-EB-FF-BA","F8-4D-89-6B-FF-07","FA-4D-89-6B-FF-07"],"hostname":"vigneshs-macbook-pro.local"},"error":{"type":"Error","code":"","message":"error executing step: after Error","stack_trace":"Error: after Error\n at /private/var/folders/_h/tz2qn0gx7gd82wxpft62zhwh0000gn/T/elastic-synthetics-unzip-150742446/journeys/journeys/debug-timeout.journey.ts:31:11\n at /Users/vigneshh/elastic/synthetics/src/helpers.ts:117:40\n at Array.map (<anonymous>)\n at runParallel (/Users/vigneshh/elastic/synthetics/src/helpers.ts:117:30)\n at Runner.runAfterHook (/Users/vigneshh/elastic/synthetics/src/core/runner.ts:180:22)\n at Runner.runJourney (/Users/vigneshh/elastic/synthetics/src/core/runner.ts:369:18)\n at Runner.run (/Users/vigneshh/elastic/synthetics/src/core/runner.ts:452:11)\n at Command.<anonymous> (/Users/vigneshh/elastic/synthetics/src/cli.ts:138:23)"},"event":{"type":"journey/end","dataset":"browser"}}
- HB summary contains error which is correct behaviour.
Its not a blocker, but good to know why its happening.
journey('after error', ({ page }) => {
after(() => {
throw new Error('after Error');
});
step('load example', async () => {
await page.goto('https://example.com/home');
});
});
* Cleanup summarizer code * Separate concerns in summarizer * Checkpoint * Fix failing tests * FMT * Tweaks * Make linter happy * progress * cleanup docs * Bring back wrappers tests (partial) * Restore wrapper tests * Fix failing tests * Fix err handling * Re-init plugins on retry * Fix error field handling across retries * Incorporate PR feedback * Type fix * URLs now work, tests passing * Improved err handling * Test fixes * Cleanup naming * Fix handling of step counts / journey/end missing and also fix continuity test * Fix failing tests around logging / logging behavior * Rename OnRetry to BeforeRetry * Move monitor.status calculation for browsers into summarizer * Cleanup status logic * More status consolidation * Fixed failing tests * Make monitor logger errors more understandable * Fix retry delay * Fix retry delay * Remove spurious 'wrapped:' in logs * Incorporate pr feedback * Fix dur * Fix cmd status * Fix tests * Fmt * Integrate PR feedback --------- Co-authored-by: Vignesh Shanmugam <[email protected]>
) [Heartbeat] Fix missing monitor.status value in initial attempt where max_attempts > 2. Introduced in elastic#36623 adding tests to the scenario runner as well. Original cause was this PR elastic#36519 that did not produce the correct monitor.status: down when the monitor is retried with the second attempt.
WIP fixes for summary duration. The changes in #36147 broke the summaries for browser monitors, causing the
monitor.duration.us
field to be attached to thejourney/end
event, and for a final summary event to be emitted that was missing this.monitor.duration.us
must be present for all summary events.The fix here was somewhat complex because prior to this PR much of the summary field generation was separate for browser vs. lightweight monitors. This PR unifies much of this logic into the new summarizer, thus trimming some logic from
synthexec
.An additional change is the removal of a number of tests in
wrappers
. There are a lot of levels of abstraction in the heartbeat processing pipeline, and the wrappers level is tricky in a number of ways, primarily because it only includes some of the fields rendered. Going forward an emphasis on testing in thescenarios
package and lower level unit tests (such as the additional assertions added to the summarizer tests) makes more sense in terms of both maintainability and accuracy.Proposed commit message
Checklist
- [ ] I have made corresponding changes to the documentation- [ ] I have made corresponding change to the default configuration files- [ ] I have added an entry inCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
Check that browser monitors always emit a
monitor.duration.us
field with summaries, as do lightweight monitors.